• DOMAIN: Semiconductor manufacturing process
• CONTEXT: A complex modern semiconductor manufacturing process is normally under constant surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs. These signals can be used as features to predict the yield type. And by analysing and trying out different combinations of features, essential signals that are impacting the yield type can be identified.
DATA DESCRIPTION: sensor-data.csv : (1567, 592) The data consists of 1567 datapoints each with 591 features. The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point.
• PROJECT OBJECTIVE: We will build a classifier to predict the Pass/Fail yield of a particular process entity and analyse whether all the features are required to build the model or not.
# Load the Libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
class color:
BOLD = '\033[1m'
END = '\033[0m'
1 Import and understand the data
Q 1A Import ‘signal-data.csv’ as DataFrame
Ans 1A
# Load the signals csv file into a dataframe
signals_df = pd.read_csv( 'FMT_Project/signal_data.csv' )
Q . 1B. Print 5 point summary and share at least 2 observations
Ans 1B.
# use describe method to display 5 point summary
signals_df.describe()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1561.000000 | 1560.000000 | 1553.000000 | 1553.000000 | 1553.000000 | 1553.0 | 1553.000000 | 1558.000000 | 1565.000000 | 1565.000000 | ... | 618.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1566.000000 | 1567.000000 |
| mean | 3014.452896 | 2495.850231 | 2200.547318 | 1396.376627 | 4.197013 | 100.0 | 101.112908 | 0.121822 | 1.462862 | -0.000841 | ... | 97.934373 | 0.500096 | 0.015318 | 0.003847 | 3.067826 | 0.021458 | 0.016475 | 0.005283 | 99.670066 | -0.867262 |
| std | 73.621787 | 80.407705 | 29.513152 | 441.691640 | 56.355540 | 0.0 | 6.237214 | 0.008961 | 0.073897 | 0.015116 | ... | 87.520966 | 0.003404 | 0.017180 | 0.003720 | 3.578033 | 0.012358 | 0.008808 | 0.002867 | 93.891919 | 0.498010 |
| min | 2743.240000 | 2158.750000 | 2060.660000 | 0.000000 | 0.681500 | 100.0 | 82.131100 | 0.000000 | 1.191000 | -0.053400 | ... | 0.000000 | 0.477800 | 0.006000 | 0.001700 | 1.197500 | -0.016900 | 0.003200 | 0.001000 | 0.000000 | -1.000000 |
| 25% | 2966.260000 | 2452.247500 | 2181.044400 | 1081.875800 | 1.017700 | 100.0 | 97.920000 | 0.121100 | 1.411200 | -0.010800 | ... | 46.184900 | 0.497900 | 0.011600 | 0.003100 | 2.306500 | 0.013425 | 0.010600 | 0.003300 | 44.368600 | -1.000000 |
| 50% | 3011.490000 | 2499.405000 | 2201.066700 | 1285.214400 | 1.316800 | 100.0 | 101.512200 | 0.122400 | 1.461600 | -0.001300 | ... | 72.288900 | 0.500200 | 0.013800 | 0.003600 | 2.757650 | 0.020500 | 0.014800 | 0.004600 | 71.900500 | -1.000000 |
| 75% | 3056.650000 | 2538.822500 | 2218.055500 | 1591.223500 | 1.525700 | 100.0 | 104.586700 | 0.123800 | 1.516900 | 0.008400 | ... | 116.539150 | 0.502375 | 0.016500 | 0.004100 | 3.295175 | 0.027600 | 0.020300 | 0.006400 | 114.749700 | -1.000000 |
| max | 3356.350000 | 2846.440000 | 2315.266700 | 3715.041700 | 1114.536600 | 100.0 | 129.252200 | 0.128600 | 1.656400 | 0.074900 | ... | 737.304800 | 0.509800 | 0.476600 | 0.104500 | 99.303200 | 0.102800 | 0.079900 | 0.028600 | 737.304800 | 1.000000 |
8 rows × 591 columns
# Capture the dtypes and analyze the dtypes to find the data types of the columns and aggregate the values
dt = signals_df.dtypes
dt.value_counts()
float64 590 object 1 int64 1 dtype: int64
# print the top 5 rows of the dataframe
signals_df.head()
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | NaN | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 208.2045 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 82.8602 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 73.8432 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | NaN | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
5 rows × 592 columns
Observations
Q 2A.. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature
And 2A
# Capture the length or number of records of Dataframe to calculate percentage to drop features.
df_len = len(signals_df)
# get the list of the column names in the datframe to iterate
collist = signals_df.columns
# create a copy of the datframe
signals_df_cp = signals_df
#Iterate the column list
for colnm in collist:
# Find the number of nulls
null_len = signals_df_cp[colnm].value_counts().isnull().sum()
# find the nan values count
nan_len = signals_df_cp[colnm].isna().sum()
#Check if the ratio of nulls or NaN against total records is greater than 20% and if so remove them
if ( (null_len / df_len) >= 0.2 or (nan_len / df_len) >= 0.2) :
print( 'Dropping feature : ', colnm )
signals_df_cp = signals_df_cp.drop( [colnm], axis = 1 )
elif ( null_len != 0 or nan_len != 0 ) :
print( 'Setting mean values in rows having null or nan value for feature : ', colnm )
signals_df_cp[ colnm ].fillna(value=signals_df_cp[ colnm ].mean( ), inplace=True)
Setting mean values in rows having null or nan value for feature : 0 Setting mean values in rows having null or nan value for feature : 1 Setting mean values in rows having null or nan value for feature : 2 Setting mean values in rows having null or nan value for feature : 3 Setting mean values in rows having null or nan value for feature : 4 Setting mean values in rows having null or nan value for feature : 5 Setting mean values in rows having null or nan value for feature : 6 Setting mean values in rows having null or nan value for feature : 7 Setting mean values in rows having null or nan value for feature : 8 Setting mean values in rows having null or nan value for feature : 9 Setting mean values in rows having null or nan value for feature : 10 Setting mean values in rows having null or nan value for feature : 11 Setting mean values in rows having null or nan value for feature : 12 Setting mean values in rows having null or nan value for feature : 13 Setting mean values in rows having null or nan value for feature : 14 Setting mean values in rows having null or nan value for feature : 15 Setting mean values in rows having null or nan value for feature : 16 Setting mean values in rows having null or nan value for feature : 17 Setting mean values in rows having null or nan value for feature : 18 Setting mean values in rows having null or nan value for feature : 19 Setting mean values in rows having null or nan value for feature : 21 Setting mean values in rows having null or nan value for feature : 22 Setting mean values in rows having null or nan value for feature : 23 Setting mean values in rows having null or nan value for feature : 24 Setting mean values in rows having null or nan value for feature : 25 Setting mean values in rows having null or nan value for feature : 26 Setting mean values in rows having null or nan value for feature : 27 Setting mean values in rows having null or nan value for feature : 28 Setting mean values in rows having null or nan value for feature : 29 Setting mean values in rows having null or nan value for feature : 30 Setting mean values in rows having null or nan value for feature : 31 Setting mean values in rows having null or nan value for feature : 32 Setting mean values in rows having null or nan value for feature : 33 Setting mean values in rows having null or nan value for feature : 34 Setting mean values in rows having null or nan value for feature : 35 Setting mean values in rows having null or nan value for feature : 36 Setting mean values in rows having null or nan value for feature : 37 Setting mean values in rows having null or nan value for feature : 38 Setting mean values in rows having null or nan value for feature : 39 Setting mean values in rows having null or nan value for feature : 40 Setting mean values in rows having null or nan value for feature : 41 Setting mean values in rows having null or nan value for feature : 42 Setting mean values in rows having null or nan value for feature : 43 Setting mean values in rows having null or nan value for feature : 44 Setting mean values in rows having null or nan value for feature : 45 Setting mean values in rows having null or nan value for feature : 46 Setting mean values in rows having null or nan value for feature : 47 Setting mean values in rows having null or nan value for feature : 48 Setting mean values in rows having null or nan value for feature : 49 Setting mean values in rows having null or nan value for feature : 50 Setting mean values in rows having null or nan value for feature : 51 Setting mean values in rows having null or nan value for feature : 52 Setting mean values in rows having null or nan value for feature : 53 Setting mean values in rows having null or nan value for feature : 54 Setting mean values in rows having null or nan value for feature : 55 Setting mean values in rows having null or nan value for feature : 56 Setting mean values in rows having null or nan value for feature : 57 Setting mean values in rows having null or nan value for feature : 58 Setting mean values in rows having null or nan value for feature : 59 Setting mean values in rows having null or nan value for feature : 60 Setting mean values in rows having null or nan value for feature : 61 Setting mean values in rows having null or nan value for feature : 62 Setting mean values in rows having null or nan value for feature : 63 Setting mean values in rows having null or nan value for feature : 64 Setting mean values in rows having null or nan value for feature : 65 Setting mean values in rows having null or nan value for feature : 66 Setting mean values in rows having null or nan value for feature : 67 Setting mean values in rows having null or nan value for feature : 68 Setting mean values in rows having null or nan value for feature : 69 Setting mean values in rows having null or nan value for feature : 70 Setting mean values in rows having null or nan value for feature : 71 Dropping feature : 72 Dropping feature : 73 Setting mean values in rows having null or nan value for feature : 74 Setting mean values in rows having null or nan value for feature : 75 Setting mean values in rows having null or nan value for feature : 76 Setting mean values in rows having null or nan value for feature : 77 Setting mean values in rows having null or nan value for feature : 78 Setting mean values in rows having null or nan value for feature : 79 Setting mean values in rows having null or nan value for feature : 80 Setting mean values in rows having null or nan value for feature : 81 Setting mean values in rows having null or nan value for feature : 82 Setting mean values in rows having null or nan value for feature : 83 Setting mean values in rows having null or nan value for feature : 84 Dropping feature : 85 Setting mean values in rows having null or nan value for feature : 89 Setting mean values in rows having null or nan value for feature : 90 Setting mean values in rows having null or nan value for feature : 91 Setting mean values in rows having null or nan value for feature : 92 Setting mean values in rows having null or nan value for feature : 93 Setting mean values in rows having null or nan value for feature : 94 Setting mean values in rows having null or nan value for feature : 95 Setting mean values in rows having null or nan value for feature : 96 Setting mean values in rows having null or nan value for feature : 97 Setting mean values in rows having null or nan value for feature : 98 Setting mean values in rows having null or nan value for feature : 99 Setting mean values in rows having null or nan value for feature : 100 Setting mean values in rows having null or nan value for feature : 101 Setting mean values in rows having null or nan value for feature : 102 Setting mean values in rows having null or nan value for feature : 103 Setting mean values in rows having null or nan value for feature : 104 Setting mean values in rows having null or nan value for feature : 105 Setting mean values in rows having null or nan value for feature : 106 Setting mean values in rows having null or nan value for feature : 107 Setting mean values in rows having null or nan value for feature : 108 Dropping feature : 109 Dropping feature : 110 Dropping feature : 111 Dropping feature : 112 Setting mean values in rows having null or nan value for feature : 118 Setting mean values in rows having null or nan value for feature : 121 Setting mean values in rows having null or nan value for feature : 122 Setting mean values in rows having null or nan value for feature : 123 Setting mean values in rows having null or nan value for feature : 124 Setting mean values in rows having null or nan value for feature : 125 Setting mean values in rows having null or nan value for feature : 126 Setting mean values in rows having null or nan value for feature : 127 Setting mean values in rows having null or nan value for feature : 128 Setting mean values in rows having null or nan value for feature : 129 Setting mean values in rows having null or nan value for feature : 130 Setting mean values in rows having null or nan value for feature : 131 Setting mean values in rows having null or nan value for feature : 132 Setting mean values in rows having null or nan value for feature : 133 Setting mean values in rows having null or nan value for feature : 134 Setting mean values in rows having null or nan value for feature : 135 Setting mean values in rows having null or nan value for feature : 136 Setting mean values in rows having null or nan value for feature : 137 Setting mean values in rows having null or nan value for feature : 138 Setting mean values in rows having null or nan value for feature : 139 Setting mean values in rows having null or nan value for feature : 140 Setting mean values in rows having null or nan value for feature : 141 Setting mean values in rows having null or nan value for feature : 142 Setting mean values in rows having null or nan value for feature : 143 Setting mean values in rows having null or nan value for feature : 144 Setting mean values in rows having null or nan value for feature : 145 Setting mean values in rows having null or nan value for feature : 146 Setting mean values in rows having null or nan value for feature : 147 Setting mean values in rows having null or nan value for feature : 148 Setting mean values in rows having null or nan value for feature : 149 Setting mean values in rows having null or nan value for feature : 150 Setting mean values in rows having null or nan value for feature : 151 Setting mean values in rows having null or nan value for feature : 152 Setting mean values in rows having null or nan value for feature : 153 Setting mean values in rows having null or nan value for feature : 154 Setting mean values in rows having null or nan value for feature : 155 Dropping feature : 157 Dropping feature : 158 Setting mean values in rows having null or nan value for feature : 159 Setting mean values in rows having null or nan value for feature : 160 Setting mean values in rows having null or nan value for feature : 161 Setting mean values in rows having null or nan value for feature : 162 Setting mean values in rows having null or nan value for feature : 163 Setting mean values in rows having null or nan value for feature : 164 Setting mean values in rows having null or nan value for feature : 165 Setting mean values in rows having null or nan value for feature : 166 Setting mean values in rows having null or nan value for feature : 167 Setting mean values in rows having null or nan value for feature : 168 Setting mean values in rows having null or nan value for feature : 169 Setting mean values in rows having null or nan value for feature : 170 Setting mean values in rows having null or nan value for feature : 171 Setting mean values in rows having null or nan value for feature : 172 Setting mean values in rows having null or nan value for feature : 173 Setting mean values in rows having null or nan value for feature : 174 Setting mean values in rows having null or nan value for feature : 175 Setting mean values in rows having null or nan value for feature : 176 Setting mean values in rows having null or nan value for feature : 177 Setting mean values in rows having null or nan value for feature : 178 Setting mean values in rows having null or nan value for feature : 179 Setting mean values in rows having null or nan value for feature : 180 Setting mean values in rows having null or nan value for feature : 181 Setting mean values in rows having null or nan value for feature : 182 Setting mean values in rows having null or nan value for feature : 183 Setting mean values in rows having null or nan value for feature : 184 Setting mean values in rows having null or nan value for feature : 185 Setting mean values in rows having null or nan value for feature : 186 Setting mean values in rows having null or nan value for feature : 187 Setting mean values in rows having null or nan value for feature : 188 Setting mean values in rows having null or nan value for feature : 189 Setting mean values in rows having null or nan value for feature : 190 Setting mean values in rows having null or nan value for feature : 191 Setting mean values in rows having null or nan value for feature : 192 Setting mean values in rows having null or nan value for feature : 193 Setting mean values in rows having null or nan value for feature : 194 Setting mean values in rows having null or nan value for feature : 195 Setting mean values in rows having null or nan value for feature : 196 Setting mean values in rows having null or nan value for feature : 197 Setting mean values in rows having null or nan value for feature : 198 Setting mean values in rows having null or nan value for feature : 199 Setting mean values in rows having null or nan value for feature : 200 Setting mean values in rows having null or nan value for feature : 201 Setting mean values in rows having null or nan value for feature : 202 Setting mean values in rows having null or nan value for feature : 203 Setting mean values in rows having null or nan value for feature : 204 Setting mean values in rows having null or nan value for feature : 205 Setting mean values in rows having null or nan value for feature : 206 Setting mean values in rows having null or nan value for feature : 207 Setting mean values in rows having null or nan value for feature : 208 Setting mean values in rows having null or nan value for feature : 209 Setting mean values in rows having null or nan value for feature : 210 Setting mean values in rows having null or nan value for feature : 211 Setting mean values in rows having null or nan value for feature : 212 Setting mean values in rows having null or nan value for feature : 213 Setting mean values in rows having null or nan value for feature : 214 Setting mean values in rows having null or nan value for feature : 215 Setting mean values in rows having null or nan value for feature : 216 Setting mean values in rows having null or nan value for feature : 217 Setting mean values in rows having null or nan value for feature : 218 Setting mean values in rows having null or nan value for feature : 219 Dropping feature : 220 Setting mean values in rows having null or nan value for feature : 224 Setting mean values in rows having null or nan value for feature : 225 Setting mean values in rows having null or nan value for feature : 226 Setting mean values in rows having null or nan value for feature : 227 Setting mean values in rows having null or nan value for feature : 228 Setting mean values in rows having null or nan value for feature : 229 Setting mean values in rows having null or nan value for feature : 230 Setting mean values in rows having null or nan value for feature : 231 Setting mean values in rows having null or nan value for feature : 232 Setting mean values in rows having null or nan value for feature : 233 Setting mean values in rows having null or nan value for feature : 234 Setting mean values in rows having null or nan value for feature : 235 Setting mean values in rows having null or nan value for feature : 236 Setting mean values in rows having null or nan value for feature : 237 Setting mean values in rows having null or nan value for feature : 238 Setting mean values in rows having null or nan value for feature : 239 Setting mean values in rows having null or nan value for feature : 240 Setting mean values in rows having null or nan value for feature : 241 Setting mean values in rows having null or nan value for feature : 242 Setting mean values in rows having null or nan value for feature : 243 Dropping feature : 244 Dropping feature : 245 Dropping feature : 246 Dropping feature : 247 Setting mean values in rows having null or nan value for feature : 253 Setting mean values in rows having null or nan value for feature : 256 Setting mean values in rows having null or nan value for feature : 257 Setting mean values in rows having null or nan value for feature : 258 Setting mean values in rows having null or nan value for feature : 259 Setting mean values in rows having null or nan value for feature : 260 Setting mean values in rows having null or nan value for feature : 261 Setting mean values in rows having null or nan value for feature : 262 Setting mean values in rows having null or nan value for feature : 263 Setting mean values in rows having null or nan value for feature : 264 Setting mean values in rows having null or nan value for feature : 265 Setting mean values in rows having null or nan value for feature : 266 Setting mean values in rows having null or nan value for feature : 267 Setting mean values in rows having null or nan value for feature : 268 Setting mean values in rows having null or nan value for feature : 269 Setting mean values in rows having null or nan value for feature : 270 Setting mean values in rows having null or nan value for feature : 271 Setting mean values in rows having null or nan value for feature : 272 Setting mean values in rows having null or nan value for feature : 273 Setting mean values in rows having null or nan value for feature : 274 Setting mean values in rows having null or nan value for feature : 275 Setting mean values in rows having null or nan value for feature : 276 Setting mean values in rows having null or nan value for feature : 277 Setting mean values in rows having null or nan value for feature : 278 Setting mean values in rows having null or nan value for feature : 279 Setting mean values in rows having null or nan value for feature : 280 Setting mean values in rows having null or nan value for feature : 281 Setting mean values in rows having null or nan value for feature : 282 Setting mean values in rows having null or nan value for feature : 283 Setting mean values in rows having null or nan value for feature : 284 Setting mean values in rows having null or nan value for feature : 285 Setting mean values in rows having null or nan value for feature : 286 Setting mean values in rows having null or nan value for feature : 287 Setting mean values in rows having null or nan value for feature : 288 Setting mean values in rows having null or nan value for feature : 289 Setting mean values in rows having null or nan value for feature : 290 Dropping feature : 292 Dropping feature : 293 Setting mean values in rows having null or nan value for feature : 294 Setting mean values in rows having null or nan value for feature : 295 Setting mean values in rows having null or nan value for feature : 296 Setting mean values in rows having null or nan value for feature : 297 Setting mean values in rows having null or nan value for feature : 298 Setting mean values in rows having null or nan value for feature : 299 Setting mean values in rows having null or nan value for feature : 300 Setting mean values in rows having null or nan value for feature : 301 Setting mean values in rows having null or nan value for feature : 302 Setting mean values in rows having null or nan value for feature : 303 Setting mean values in rows having null or nan value for feature : 304 Setting mean values in rows having null or nan value for feature : 305 Setting mean values in rows having null or nan value for feature : 306 Setting mean values in rows having null or nan value for feature : 307 Setting mean values in rows having null or nan value for feature : 308 Setting mean values in rows having null or nan value for feature : 309 Setting mean values in rows having null or nan value for feature : 310 Setting mean values in rows having null or nan value for feature : 311 Setting mean values in rows having null or nan value for feature : 312 Setting mean values in rows having null or nan value for feature : 313 Setting mean values in rows having null or nan value for feature : 314 Setting mean values in rows having null or nan value for feature : 315 Setting mean values in rows having null or nan value for feature : 316 Setting mean values in rows having null or nan value for feature : 317 Setting mean values in rows having null or nan value for feature : 318 Setting mean values in rows having null or nan value for feature : 319 Setting mean values in rows having null or nan value for feature : 320 Setting mean values in rows having null or nan value for feature : 321 Setting mean values in rows having null or nan value for feature : 322 Setting mean values in rows having null or nan value for feature : 323 Setting mean values in rows having null or nan value for feature : 324 Setting mean values in rows having null or nan value for feature : 325 Setting mean values in rows having null or nan value for feature : 326 Setting mean values in rows having null or nan value for feature : 327 Setting mean values in rows having null or nan value for feature : 328 Setting mean values in rows having null or nan value for feature : 329 Setting mean values in rows having null or nan value for feature : 330 Setting mean values in rows having null or nan value for feature : 331 Setting mean values in rows having null or nan value for feature : 332 Setting mean values in rows having null or nan value for feature : 333 Setting mean values in rows having null or nan value for feature : 334 Setting mean values in rows having null or nan value for feature : 335 Setting mean values in rows having null or nan value for feature : 336 Setting mean values in rows having null or nan value for feature : 337 Setting mean values in rows having null or nan value for feature : 338 Setting mean values in rows having null or nan value for feature : 339 Setting mean values in rows having null or nan value for feature : 340 Setting mean values in rows having null or nan value for feature : 341 Setting mean values in rows having null or nan value for feature : 342 Setting mean values in rows having null or nan value for feature : 343 Setting mean values in rows having null or nan value for feature : 344 Dropping feature : 345 Dropping feature : 346 Setting mean values in rows having null or nan value for feature : 347 Setting mean values in rows having null or nan value for feature : 348 Setting mean values in rows having null or nan value for feature : 349 Setting mean values in rows having null or nan value for feature : 350 Setting mean values in rows having null or nan value for feature : 351 Setting mean values in rows having null or nan value for feature : 352 Setting mean values in rows having null or nan value for feature : 353 Setting mean values in rows having null or nan value for feature : 354 Setting mean values in rows having null or nan value for feature : 355 Setting mean values in rows having null or nan value for feature : 356 Setting mean values in rows having null or nan value for feature : 357 Dropping feature : 358 Setting mean values in rows having null or nan value for feature : 362 Setting mean values in rows having null or nan value for feature : 363 Setting mean values in rows having null or nan value for feature : 364 Setting mean values in rows having null or nan value for feature : 365 Setting mean values in rows having null or nan value for feature : 366 Setting mean values in rows having null or nan value for feature : 367 Setting mean values in rows having null or nan value for feature : 368 Setting mean values in rows having null or nan value for feature : 369 Setting mean values in rows having null or nan value for feature : 370 Setting mean values in rows having null or nan value for feature : 371 Setting mean values in rows having null or nan value for feature : 372 Setting mean values in rows having null or nan value for feature : 373 Setting mean values in rows having null or nan value for feature : 374 Setting mean values in rows having null or nan value for feature : 375 Setting mean values in rows having null or nan value for feature : 376 Setting mean values in rows having null or nan value for feature : 377 Setting mean values in rows having null or nan value for feature : 378 Setting mean values in rows having null or nan value for feature : 379 Setting mean values in rows having null or nan value for feature : 380 Setting mean values in rows having null or nan value for feature : 381 Dropping feature : 382 Dropping feature : 383 Dropping feature : 384 Dropping feature : 385 Setting mean values in rows having null or nan value for feature : 391 Setting mean values in rows having null or nan value for feature : 394 Setting mean values in rows having null or nan value for feature : 395 Setting mean values in rows having null or nan value for feature : 396 Setting mean values in rows having null or nan value for feature : 397 Setting mean values in rows having null or nan value for feature : 398 Setting mean values in rows having null or nan value for feature : 399 Setting mean values in rows having null or nan value for feature : 400 Setting mean values in rows having null or nan value for feature : 401 Setting mean values in rows having null or nan value for feature : 402 Setting mean values in rows having null or nan value for feature : 403 Setting mean values in rows having null or nan value for feature : 404 Setting mean values in rows having null or nan value for feature : 405 Setting mean values in rows having null or nan value for feature : 406 Setting mean values in rows having null or nan value for feature : 407 Setting mean values in rows having null or nan value for feature : 408 Setting mean values in rows having null or nan value for feature : 409 Setting mean values in rows having null or nan value for feature : 410 Setting mean values in rows having null or nan value for feature : 411 Setting mean values in rows having null or nan value for feature : 412 Setting mean values in rows having null or nan value for feature : 413 Setting mean values in rows having null or nan value for feature : 414 Setting mean values in rows having null or nan value for feature : 415 Setting mean values in rows having null or nan value for feature : 416 Setting mean values in rows having null or nan value for feature : 417 Setting mean values in rows having null or nan value for feature : 418 Setting mean values in rows having null or nan value for feature : 419 Setting mean values in rows having null or nan value for feature : 420 Setting mean values in rows having null or nan value for feature : 421 Setting mean values in rows having null or nan value for feature : 422 Setting mean values in rows having null or nan value for feature : 423 Setting mean values in rows having null or nan value for feature : 424 Setting mean values in rows having null or nan value for feature : 425 Setting mean values in rows having null or nan value for feature : 426 Setting mean values in rows having null or nan value for feature : 427 Setting mean values in rows having null or nan value for feature : 428 Setting mean values in rows having null or nan value for feature : 430 Setting mean values in rows having null or nan value for feature : 431 Setting mean values in rows having null or nan value for feature : 432 Setting mean values in rows having null or nan value for feature : 433 Setting mean values in rows having null or nan value for feature : 434 Setting mean values in rows having null or nan value for feature : 435 Setting mean values in rows having null or nan value for feature : 436 Setting mean values in rows having null or nan value for feature : 437 Setting mean values in rows having null or nan value for feature : 438 Setting mean values in rows having null or nan value for feature : 439 Setting mean values in rows having null or nan value for feature : 440 Setting mean values in rows having null or nan value for feature : 441 Setting mean values in rows having null or nan value for feature : 442 Setting mean values in rows having null or nan value for feature : 443 Setting mean values in rows having null or nan value for feature : 444 Setting mean values in rows having null or nan value for feature : 445 Setting mean values in rows having null or nan value for feature : 446 Setting mean values in rows having null or nan value for feature : 447 Setting mean values in rows having null or nan value for feature : 448 Setting mean values in rows having null or nan value for feature : 449 Setting mean values in rows having null or nan value for feature : 450 Setting mean values in rows having null or nan value for feature : 451 Setting mean values in rows having null or nan value for feature : 452 Setting mean values in rows having null or nan value for feature : 453 Setting mean values in rows having null or nan value for feature : 454 Setting mean values in rows having null or nan value for feature : 455 Setting mean values in rows having null or nan value for feature : 456 Setting mean values in rows having null or nan value for feature : 457 Setting mean values in rows having null or nan value for feature : 458 Setting mean values in rows having null or nan value for feature : 459 Setting mean values in rows having null or nan value for feature : 460 Setting mean values in rows having null or nan value for feature : 461 Setting mean values in rows having null or nan value for feature : 462 Setting mean values in rows having null or nan value for feature : 463 Setting mean values in rows having null or nan value for feature : 464 Setting mean values in rows having null or nan value for feature : 465 Setting mean values in rows having null or nan value for feature : 466 Setting mean values in rows having null or nan value for feature : 467 Setting mean values in rows having null or nan value for feature : 468 Setting mean values in rows having null or nan value for feature : 469 Setting mean values in rows having null or nan value for feature : 470 Setting mean values in rows having null or nan value for feature : 471 Setting mean values in rows having null or nan value for feature : 472 Setting mean values in rows having null or nan value for feature : 473 Setting mean values in rows having null or nan value for feature : 474 Setting mean values in rows having null or nan value for feature : 475 Setting mean values in rows having null or nan value for feature : 476 Setting mean values in rows having null or nan value for feature : 477 Setting mean values in rows having null or nan value for feature : 478 Setting mean values in rows having null or nan value for feature : 479 Setting mean values in rows having null or nan value for feature : 480 Setting mean values in rows having null or nan value for feature : 481 Setting mean values in rows having null or nan value for feature : 482 Setting mean values in rows having null or nan value for feature : 483 Setting mean values in rows having null or nan value for feature : 484 Setting mean values in rows having null or nan value for feature : 485 Setting mean values in rows having null or nan value for feature : 486 Setting mean values in rows having null or nan value for feature : 487 Setting mean values in rows having null or nan value for feature : 488 Setting mean values in rows having null or nan value for feature : 489 Setting mean values in rows having null or nan value for feature : 490 Setting mean values in rows having null or nan value for feature : 491 Dropping feature : 492 Setting mean values in rows having null or nan value for feature : 496 Setting mean values in rows having null or nan value for feature : 497 Setting mean values in rows having null or nan value for feature : 498 Setting mean values in rows having null or nan value for feature : 499 Setting mean values in rows having null or nan value for feature : 500 Setting mean values in rows having null or nan value for feature : 501 Setting mean values in rows having null or nan value for feature : 502 Setting mean values in rows having null or nan value for feature : 503 Setting mean values in rows having null or nan value for feature : 504 Setting mean values in rows having null or nan value for feature : 505 Setting mean values in rows having null or nan value for feature : 506 Setting mean values in rows having null or nan value for feature : 507 Setting mean values in rows having null or nan value for feature : 508 Setting mean values in rows having null or nan value for feature : 509 Setting mean values in rows having null or nan value for feature : 510 Setting mean values in rows having null or nan value for feature : 511 Setting mean values in rows having null or nan value for feature : 512 Setting mean values in rows having null or nan value for feature : 513 Setting mean values in rows having null or nan value for feature : 514 Setting mean values in rows having null or nan value for feature : 515 Dropping feature : 516 Dropping feature : 517 Dropping feature : 518 Dropping feature : 519 Setting mean values in rows having null or nan value for feature : 525 Setting mean values in rows having null or nan value for feature : 528 Setting mean values in rows having null or nan value for feature : 529 Setting mean values in rows having null or nan value for feature : 530 Setting mean values in rows having null or nan value for feature : 531 Setting mean values in rows having null or nan value for feature : 532 Setting mean values in rows having null or nan value for feature : 533 Setting mean values in rows having null or nan value for feature : 534 Setting mean values in rows having null or nan value for feature : 535 Setting mean values in rows having null or nan value for feature : 536 Setting mean values in rows having null or nan value for feature : 537 Setting mean values in rows having null or nan value for feature : 538 Setting mean values in rows having null or nan value for feature : 539 Setting mean values in rows having null or nan value for feature : 540 Setting mean values in rows having null or nan value for feature : 541 Setting mean values in rows having null or nan value for feature : 542 Setting mean values in rows having null or nan value for feature : 543 Setting mean values in rows having null or nan value for feature : 544 Setting mean values in rows having null or nan value for feature : 545 Setting mean values in rows having null or nan value for feature : 546 Setting mean values in rows having null or nan value for feature : 547 Setting mean values in rows having null or nan value for feature : 548 Setting mean values in rows having null or nan value for feature : 549 Setting mean values in rows having null or nan value for feature : 550 Setting mean values in rows having null or nan value for feature : 551 Setting mean values in rows having null or nan value for feature : 552 Setting mean values in rows having null or nan value for feature : 553 Setting mean values in rows having null or nan value for feature : 554 Setting mean values in rows having null or nan value for feature : 555 Setting mean values in rows having null or nan value for feature : 556 Setting mean values in rows having null or nan value for feature : 557 Setting mean values in rows having null or nan value for feature : 558 Setting mean values in rows having null or nan value for feature : 559 Setting mean values in rows having null or nan value for feature : 560 Setting mean values in rows having null or nan value for feature : 561 Setting mean values in rows having null or nan value for feature : 562 Setting mean values in rows having null or nan value for feature : 563 Setting mean values in rows having null or nan value for feature : 564 Setting mean values in rows having null or nan value for feature : 565 Setting mean values in rows having null or nan value for feature : 566 Setting mean values in rows having null or nan value for feature : 567 Setting mean values in rows having null or nan value for feature : 568 Setting mean values in rows having null or nan value for feature : 569 Dropping feature : 578 Dropping feature : 579 Dropping feature : 580 Dropping feature : 581 Setting mean values in rows having null or nan value for feature : 582 Setting mean values in rows having null or nan value for feature : 583 Setting mean values in rows having null or nan value for feature : 584 Setting mean values in rows having null or nan value for feature : 585 Setting mean values in rows having null or nan value for feature : 586 Setting mean values in rows having null or nan value for feature : 587 Setting mean values in rows having null or nan value for feature : 588 Setting mean values in rows having null or nan value for feature : 589
signals_df_cp.shape
(1567, 560)
Q 2B dentify and drop the features which are having same value for all the rows.
Ans 2B.
# get the column list from the cleaned up dataframe
collist_cln = signals_df_cp.columns
# Iterate the columns
for colnm in collist_cln:
# find if the columns unique value is 1 and drop that feature
if ( len( signals_df_cp[colnm].value_counts() ) == 1 ) :
print( 'Dropping feature holding same value in all rows : ', colnm )
signals_df_cp = signals_df_cp.drop( [colnm], axis = 1 )
Dropping feature holding same value in all rows : 5 Dropping feature holding same value in all rows : 13 Dropping feature holding same value in all rows : 42 Dropping feature holding same value in all rows : 49 Dropping feature holding same value in all rows : 52 Dropping feature holding same value in all rows : 69 Dropping feature holding same value in all rows : 97 Dropping feature holding same value in all rows : 141 Dropping feature holding same value in all rows : 149 Dropping feature holding same value in all rows : 178 Dropping feature holding same value in all rows : 179 Dropping feature holding same value in all rows : 186 Dropping feature holding same value in all rows : 189 Dropping feature holding same value in all rows : 190 Dropping feature holding same value in all rows : 191 Dropping feature holding same value in all rows : 192 Dropping feature holding same value in all rows : 193 Dropping feature holding same value in all rows : 194 Dropping feature holding same value in all rows : 226 Dropping feature holding same value in all rows : 229 Dropping feature holding same value in all rows : 230 Dropping feature holding same value in all rows : 231 Dropping feature holding same value in all rows : 232 Dropping feature holding same value in all rows : 233 Dropping feature holding same value in all rows : 234 Dropping feature holding same value in all rows : 235 Dropping feature holding same value in all rows : 236 Dropping feature holding same value in all rows : 237 Dropping feature holding same value in all rows : 240 Dropping feature holding same value in all rows : 241 Dropping feature holding same value in all rows : 242 Dropping feature holding same value in all rows : 243 Dropping feature holding same value in all rows : 256 Dropping feature holding same value in all rows : 257 Dropping feature holding same value in all rows : 258 Dropping feature holding same value in all rows : 259 Dropping feature holding same value in all rows : 260 Dropping feature holding same value in all rows : 261 Dropping feature holding same value in all rows : 262 Dropping feature holding same value in all rows : 263 Dropping feature holding same value in all rows : 264 Dropping feature holding same value in all rows : 265 Dropping feature holding same value in all rows : 266 Dropping feature holding same value in all rows : 276 Dropping feature holding same value in all rows : 284 Dropping feature holding same value in all rows : 313 Dropping feature holding same value in all rows : 314 Dropping feature holding same value in all rows : 315 Dropping feature holding same value in all rows : 322 Dropping feature holding same value in all rows : 325 Dropping feature holding same value in all rows : 326 Dropping feature holding same value in all rows : 327 Dropping feature holding same value in all rows : 328 Dropping feature holding same value in all rows : 329 Dropping feature holding same value in all rows : 330 Dropping feature holding same value in all rows : 364 Dropping feature holding same value in all rows : 369 Dropping feature holding same value in all rows : 370 Dropping feature holding same value in all rows : 371 Dropping feature holding same value in all rows : 372 Dropping feature holding same value in all rows : 373 Dropping feature holding same value in all rows : 374 Dropping feature holding same value in all rows : 375 Dropping feature holding same value in all rows : 378 Dropping feature holding same value in all rows : 379 Dropping feature holding same value in all rows : 380 Dropping feature holding same value in all rows : 381 Dropping feature holding same value in all rows : 394 Dropping feature holding same value in all rows : 395 Dropping feature holding same value in all rows : 396 Dropping feature holding same value in all rows : 397 Dropping feature holding same value in all rows : 398 Dropping feature holding same value in all rows : 399 Dropping feature holding same value in all rows : 400 Dropping feature holding same value in all rows : 401 Dropping feature holding same value in all rows : 402 Dropping feature holding same value in all rows : 403 Dropping feature holding same value in all rows : 404 Dropping feature holding same value in all rows : 414 Dropping feature holding same value in all rows : 422 Dropping feature holding same value in all rows : 449 Dropping feature holding same value in all rows : 450 Dropping feature holding same value in all rows : 451 Dropping feature holding same value in all rows : 458 Dropping feature holding same value in all rows : 461 Dropping feature holding same value in all rows : 462 Dropping feature holding same value in all rows : 463 Dropping feature holding same value in all rows : 464 Dropping feature holding same value in all rows : 465 Dropping feature holding same value in all rows : 466 Dropping feature holding same value in all rows : 481 Dropping feature holding same value in all rows : 498 Dropping feature holding same value in all rows : 501 Dropping feature holding same value in all rows : 502 Dropping feature holding same value in all rows : 503 Dropping feature holding same value in all rows : 504 Dropping feature holding same value in all rows : 505 Dropping feature holding same value in all rows : 506 Dropping feature holding same value in all rows : 507 Dropping feature holding same value in all rows : 508 Dropping feature holding same value in all rows : 509 Dropping feature holding same value in all rows : 512 Dropping feature holding same value in all rows : 513 Dropping feature holding same value in all rows : 514 Dropping feature holding same value in all rows : 515 Dropping feature holding same value in all rows : 528 Dropping feature holding same value in all rows : 529 Dropping feature holding same value in all rows : 530 Dropping feature holding same value in all rows : 531 Dropping feature holding same value in all rows : 532 Dropping feature holding same value in all rows : 533 Dropping feature holding same value in all rows : 534 Dropping feature holding same value in all rows : 535 Dropping feature holding same value in all rows : 536 Dropping feature holding same value in all rows : 537 Dropping feature holding same value in all rows : 538
signals_df_cp.shape
(1567, 444)
Q 2C . Drop other features if required using relevant functional knowledge. Clearly justify the same
Ans 2C
#signals_df_cp.to_csv("FMT_Project\signal_data_cleaned.csv")
droplst = ['74', '114', '206', '209', '249', '342', '347', '387', '478', '521']
for colnm in droplst:
signals_df_cp = signals_df_cp.drop( [colnm], axis = 1 )
signals_df_cp.shape
(1567, 434)
Some of the columns had 98% same value and the variance is very less to influence the preditction and accuracy. Hence these columns are identifed and dropped
Q 2D. . . Check for multi-collinearity in the data and take necessary action.
Ans 2D Find the VIF value to find multi-collinearity and remove features with high VIF value
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Remove target feature and create temp dataframe for VIF
vif_temp_df = signals_df_cp
vif_temp_df = vif_temp_df.drop( ['Pass/Fail'], axis = 1 )
vif_temp_df = vif_temp_df.drop( ['Time'], axis = 1 )
#vif_temp_df = vif_temp_df[~vif_temp_df.isin([np.nan, np.inf, -np.inf]).any(1)]
vif_data = pd.DataFrame()
vif_data["feature_name"] = vif_temp_df.columns
vif_data["VIF_value"] = [variance_inflation_factor(vif_temp_df.values, i)
for i in range(len(vif_temp_df.columns))]
print(vif_data)
feature_name VIF_value 0 0 24158.000431 1 1 9533.835746 2 2 152823.076949 3 3 127.010069 4 4 42400.955890 .. ... ... 427 585 18342.454732 428 586 9.662838 429 587 130.393833 430 588 124.658955 431 589 5.792604 [432 rows x 2 columns]
# Use a feature to capture the high vif columns
high_vif_feature = []
# Iterate through the vif dataframe and add the features that has vif_value more than 5
for vif_idx in vif_data.index:
if( vif_data['VIF_value'][vif_idx] > 5 ) :
high_vif_feature.append( vif_data['feature_name'][vif_idx] )
for colmn_nm in high_vif_feature:
signals_df_cp = signals_df_cp.drop( [ colmn_nm ], axis = 1 )
Q 2E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions.
Ans 2E
signals_df_cp.isnull().sum()
Time 0 9 0 10 0 24 0 75 0 77 0 78 0 79 0 80 0 82 0 91 0 95 0 102 0 107 0 108 0 129 0 418 0 419 0 432 0 433 0 468 0 482 0 483 0 484 0 485 0 486 0 487 0 488 0 489 0 499 0 500 0 511 0 Pass/Fail 0 dtype: int64
signals_df_cp.dtypes
Time object 9 float64 10 float64 24 float64 75 float64 77 float64 78 float64 79 float64 80 float64 82 float64 91 float64 95 float64 102 float64 107 float64 108 float64 129 float64 418 float64 419 float64 432 float64 433 float64 468 float64 482 float64 483 float64 484 float64 485 float64 486 float64 487 float64 488 float64 489 float64 499 float64 500 float64 511 float64 Pass/Fail int64 dtype: object
#Drop the time feature as it does does not help in prediction
signals_df_cp = signals_df_cp.drop( [ 'Time' ], axis = 1 )
signals_df_cp.head()
| 9 | 10 | 24 | 75 | 77 | 78 | 79 | 80 | 82 | 91 | ... | 484 | 485 | 486 | 487 | 488 | 489 | 499 | 500 | 511 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0162 | -0.0034 | 751.00 | 0.0126 | 0.0141 | -0.0307 | -0.0083 | -0.0026 | -0.0044 | -0.3274 | ... | 494.6996 | 178.1759 | 843.1138 | 0.0000 | 53.1098 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | -1 |
| 1 | -0.0005 | -0.0148 | -1640.25 | -0.0039 | 0.0004 | -0.0440 | -0.0358 | -0.0120 | 0.0017 | 0.1455 | ... | 0.0000 | 359.0444 | 130.6350 | 820.7900 | 194.4371 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | -1 |
| 2 | 0.0041 | 0.0013 | -1916.50 | -0.0078 | -0.0052 | 0.0213 | -0.0054 | -0.1134 | 0.0287 | 0.0553 | ... | 0.0000 | 190.3869 | 746.9150 | 74.0741 | 191.7582 | 250.1742 | 0.0000 | 0.0000 | 244.2748 | 1 |
| 3 | -0.0124 | -0.0033 | -1657.25 | -0.0555 | -0.0400 | 0.0400 | 0.0676 | -0.1051 | 0.0277 | 0.0697 | ... | 305.7500 | 88.5553 | 104.6660 | 71.7583 | 0.0000 | 336.7660 | 0.0000 | 711.6418 | 0.0000 | -1 |
| 4 | -0.0031 | -0.0072 | 117.00 | -0.0534 | -0.0167 | -0.0449 | 0.0034 | -0.0178 | -0.0048 | 0.0448 | ... | 461.8619 | 240.1781 | 0.0000 | 587.3773 | 748.1781 | 0.0000 | 293.1396 | 0.0000 | 0.0000 | -1 |
5 rows × 32 columns
#signals_df_cp.to_csv("FMT_Project\signal_data_cleaned1.csv")
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# Copy all the predictor variables into X dataframe. Since 'Pass/Fail' is dependent variable drop it
X = signals_df_cp.drop('Pass/Fail', axis=1)
# Copy the 'Pass/Fail' column alone into the y dataframe. This is the dependent variable
y = signals_df_cp[['Pass/Fail']]
#Standardize the data
X_scaled = preprocessing.scale(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
y_scaled = preprocessing.scale(y)
y_scaled = pd.DataFrame(y_scaled, columns=y.columns)
#Split the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.30, random_state=1)
#Use the Ridge shrinkage model to identify the unwanted features
ridge = Ridge(alpha=.3)
ridge.fit(X_train,y_train)
print ("Ridge model:", (ridge.coef_))
Ridge model: [[-0.03101568 0.04150655 -0.03900849 0.0605521 0.00857828 -0.02664311 0.01097872 -0.02373694 0.03078095 0.00963695 0.03897061 -0.05069726 -0.01498224 0.02894236 0.08538543 -0.02674625 -0.02127135 -0.00266133 0.05794623 -0.0468209 -0.01113797 -0.00090975 -0.03678205 -0.06164721 0.01730098 -0.02669777 -0.05168917 0.02447552 -0.0220395 0.06025086 0.04317285]]
# drop the featires identified through Ridge that will not help the target predication
len(ridge.coef_)
colm_nm_del = signals_df_cp.columns
for i in range(len(ridge.coef_)):
for j in range(len(ridge.coef_[i])):
if( ridge.coef_[i][j] < 0) :
signals_df_cp = signals_df_cp.drop([colm_nm_del[j]], axis = 1)
print(signals_df_cp.columns)
Index(['10', '75', '77', '79', '82', '91', '95', '108', '129', '433', '486',
'489', '500', '511', 'Pass/Fail'],
dtype='object')
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
print(regression_model.score(X_train, y_train))
print(regression_model.score(X_test, y_test))
0.04313541843738511 0.01125964294920545
print(ridge.score(X_train, y_train))
print(ridge.score(X_test, y_test))
0.04313541448940017 0.011268407852808804
3. Data analysis & visualisation
Q 3A Perform a detailed univariate Analysis with appropriate detailed comments after each analysis
Ans 3A.
#Choosing feature '75' for univariate analysis
uv_analys_df = signals_df_cp['75']
#use bar histogram to split the data by group of 10 ranges
plt.hist(uv_analys_df, bins=10)
(array([ 18., 87., 759., 661., 39., 1., 1., 0., 0., 1.]),
array([-0.1049 , -0.07126, -0.03762, -0.00398, 0.02966, 0.0633 ,
0.09694, 0.13058, 0.16422, 0.19786, 0.2315 ]),
<BarContainer object of 10 artists>)
# Use the distplot to find the distribution
sns.distplot( uv_analys_df )
<AxesSubplot:xlabel='75', ylabel='Density'>
sns.distplot(uv_analys_df, hist=False)
<AxesSubplot:xlabel='75', ylabel='Density'>
# Use the histogram chart and plot the median, mean and mode to identify how they are placed
plt.hist(uv_analys_df, color='b')
plt.axvline(uv_analys_df.mean(), color='g', linewidth=1)
plt.axvline(uv_analys_df.median(), color='y', linestyle='dashed', linewidth=1)
plt.axvline(uv_analys_df.mode()[0], color='w', linestyle='dashed', linewidth=1)
<matplotlib.lines.Line2D at 0x1a8713ef970>
The distribution of feature '75' follows close to agaussian / normal distribution Most of the observations lies between -0.35 and 0.25 i.e class 3 and 4 has the most of the values The median , mean and mode , all three are very closely placed in the distribution
Q 3B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis.
Ans 3B.
#Use the pairplot to find the correlation pattern
sns.pairplot(signals_df_cp)
<seaborn.axisgrid.PairGrid at 0x1a8007a3fa0>
# Use the pairplot for selective features to find the distribution of target feature classification
plot_fig = sns.pairplot( signals_df_cp, vars = ['91', '75','77', '79'], hue='Pass/Fail', palette='colorblind' )
plot_fig.fig.set_size_inches(16,16)
# Plot a heat map to find the correlation of the data
plt.figure(figsize=(10,5))
sns.heatmap(signals_df_cp.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 )
plt.show()
Very less correlation exisits between the features and target variable Data imbalanced as there are more data available for failure scenarios
4. Data pre-processing:
Q 4A. Segregate predictors vs target attributes.
Ans 4A.
# Copy all the predictor variables into X dataframe. Since 'Pass/Fail' is target variable hence drop it
X_Pred = signals_df_cp.drop('Pass/Fail', axis=1)
# Copy the 'Pass/Fail' column alone into the y dataframe. This is the target variable
Y_Tgt = signals_df_cp[['Pass/Fail']]
Q 4B. Check for target balancing and fix it if found imbalanced
Ans 4B
# Find the unique values in the target variable and its count
Y_Tgt.value_counts()
Pass/Fail -1 1463 1 104 dtype: int64
Pass (value 1 considered as Pass) is less than 10% of the whole column data. Change alternate rows target value to 1. To enable easy classification change -1 to 0
temp_df = signals_df_cp
row_count = 0
chg_tgt_val = True
# Iterate through the target feature and balance the data.
# Change the alterante row negative values to positive value classification and also change the negative values to zero to aid
# models to perform classification prediction better
for index, row in signals_df_cp.iterrows():
if (row['Pass/Fail'] == -1 ):
if(chg_tgt_val):
temp_df.at[index,'Pass/Fail'] = 1
row_count = row_count + 1
chg_tgt_val = False
else:
temp_df.at[index,'Pass/Fail'] = 0
chg_tgt_val = True
temp_df['Pass/Fail'].value_counts()
1 836 0 731 Name: Pass/Fail, dtype: int64
X = signals_df_cp.drop('Pass/Fail', axis=1)
Y = signals_df_cp[['Pass/Fail']]
Q 4C. C. Perform train-test split and standardise the data or vice versa if required.
Ans 4C
# Scale the predictor features to stabdardize the values
X_scaled = preprocessing.scale(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.30, random_state=2)
Q 4D. Check if the train and test data have similar statistical characteristics when compared with original data.
Ans 4D.
# Plot the complete dataset in a heatmap
plt.figure(figsize=(10,5))
sns.heatmap(signals_df_cp.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 )
plt.show()
# Plot the training dataset in a heatmap
plt.figure(figsize=(10,5))
sns.heatmap(X_train.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 )
plt.show()
# Plot the complete test datsset in a heatmap
plt.figure(figsize=(10,5))
sns.heatmap(X_test.corr(), annot=True, linewidths=.5, fmt= '.1f', center = 1 )
plt.show()
The Correlation between Train, test and orginal data is simillar
5. Model training, testing and tuning:
Q 5A. Use any Supervised Learning technique to train a model.
Ans 5A. Use decision tree classifier
#Use the Decision Treee classifier
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier(criterion = 'entropy' )
dt_model.fit(X_train, y_train)
DecisionTreeClassifier(criterion='entropy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(criterion='entropy')
dt_model.score(X_test , y_test)
0.505307855626327
# Print the confusion matrix
from sklearn import metrics
y_predict = dt_model.predict(X_test)
print(metrics.confusion_matrix(y_test, y_predict))
[[104 125] [108 134]]
Q 5B. B. Use cross validation techniques.
Ans 5B.
# Use the KFlod and corss value score validate the dataset
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=10, random_state=5, shuffle=True)
results = cross_val_score(dt_model,X, y, cv=kfold)
results
array([0.91082803, 0.87898089, 0.87261146, 0.89171975, 0.88535032,
0.87898089, 0.88535032, 0.87820513, 0.88461538, 0.8974359 ])
np.mean(abs(results))
0.8864078066307366
results.std()
0.010580987727327676
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
scores = cross_val_score(dt_model, X, y, cv=LeaveOneOut())
scores
array([1., 1., 1., ..., 1., 1., 1.])
scores.mean()
0.8768347160178686
scores.std()
0.3286268351850353
Q 5C. Apply hyper-parameter tuning techniques to get the best accuracy.
Ans 5C
# Use Grid search to find the optimum parameters
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score, KFold
dt_model1 = DecisionTreeClassifier()
params = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15)}
gridSrch=GridSearchCV(dt_model1, cv=2, param_grid=params, scoring='neg_mean_squared_error')
#Train the model
gridSrch.fit(X_train, y_train)
print('Best Extimator from grid search : ' , gridSrch.best_estimator_)
Best Extimator from grid search : DecisionTreeClassifier(criterion='entropy', max_depth=3)
Q 5D. D. Use any other technique/method which can enhance the model performance.
Ans 5D
# Use the Random tree classifier to find and enhance the model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=2,n_estimators=500,criterion="entropy",random_state=10)
rfm=rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)
print(rf.score(X_test, y_test))
print(metrics.confusion_matrix(y_test, y_predict))
0.505307855626327 [[ 70 159] [ 74 168]]
# Use up or over sampling to increase the predictors data set
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
X_train_res.shape
(1188, 14)
# Train the model using over sampled training data
rfm=rf.fit(X_train_res, y_train_res)
y_predict = rf.predict(X_test)
print(rf.score(X_test, y_test))
print(metrics.confusion_matrix(y_test, y_predict))
0.5031847133757962 [[ 96 133] [101 141]]
# Use the under sampling to chek the model performance
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy='majority')
X_rus, y_rus = rus.fit_resample(X_train, y_train)
y_rus
| Pass/Fail | |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| ... | ... |
| 999 | 1 |
| 1000 | 1 |
| 1001 | 1 |
| 1002 | 1 |
| 1003 | 1 |
1004 rows × 1 columns
# Train the model using under sample dataset
rfm=rf.fit(X_rus, y_rus)
y_predict = rf.predict(X_test)
print(rf.score(X_test, y_test))
print(metrics.confusion_matrix(y_test, y_predict))
0.4585987261146497 [[100 129] [126 116]]
Q 5E. Display and explain the classification report in detail.
Ans 5E
print(metrics.classification_report(y_test, y_predict))
precision recall f1-score support
0 0.44 0.44 0.44 229
1 0.47 0.48 0.48 242
accuracy 0.46 471
macro avg 0.46 0.46 0.46 471
weighted avg 0.46 0.46 0.46 471
Q 5F. Apply the above steps for all possible models that you have learnt so far.
Ans 5F. Use Logistic Regression, KNN, Random Forest, Decision Tree and SVM models to validate the regular dataset, over sampled dataset and under sampled dataset
from sklearn.linear_model import LogisticRegression
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
model_trn_score = logreg_model.score(X_train, y_train)
model_tst_score = logreg_model.score(X_test, y_test)
print('Logictic Regression Normal Training Score: {:.2f}%'.format(model_trn_score))
print('Logictic Regression Normal Testing Score: {:.2f}%'.format(model_tst_score))
#train model with Over sampled data
logreg_model.fit(X_train_res, y_train_res)
model_trn_score = logreg_model.score(X_train_res, y_train_res)
model_tst_score = logreg_model.score(X_test, y_test)
print('Logictic Regression Over sampling Training Score: {:.2f}%'.format(model_trn_score))
print('Logictic Regression Over sampling Testing Score: {:.2f}%'.format(model_tst_score))
#train model with under sampled data
logreg_model.fit(X_rus, y_rus)
model_trn_score = logreg_model.score(X_rus, y_rus)
model_tst_score = logreg_model.score(X_test, y_test)
print('Logictic Regression Under sampling Training Score: {:.2f}%'.format(model_trn_score))
print('Logictic Regression Under sampling Testing Score: {:.2f}%'.format(model_tst_score))
Logictic Regression Normal Training Score: 0.56% Logictic Regression Normal Testing Score: 0.51% Logictic Regression Over sampling Training Score: 0.56% Logictic Regression Over sampling Testing Score: 0.49% Logictic Regression Under sampling Training Score: 0.55% Logictic Regression Under sampling Testing Score: 0.48%
from sklearn.neighbors import KNeighborsClassifier
NNH = KNeighborsClassifier(n_neighbors= 5 , weights = 'distance' )
NNH.fit(X_train, y_train)
NNH_trn_score = logreg_model.score(X_train, y_train)
NNH_tst_score = NNH.score(X_test, y_test)
print('KNNeighbor Normal Training Score: {:.2f}%'.format(NNH_trn_score))
print('KNNeighbor Normal Testing Score: {:.2f}%'.format(NNH_tst_score))
#train model with Over sampled data
NNH.fit(X_train_res, y_train_res)
NNH_trn_score = logreg_model.score(X_train_res, y_train_res)
NNH_tst_score = NNH.score(X_test, y_test)
print('KNNeighbor Over sampling Training Score: {:.2f}%'.format(NNH_trn_score))
print('KNNeighbor Over sampling Testing Score: {:.2f}%'.format(NNH_tst_score))
#train model with under sampled data
NNH.fit(X_rus, y_rus)
NNH_trn_score = logreg_model.score(X_rus, y_rus)
NNH_tst_score = NNH.score(X_test, y_test)
print('KNNeighbor Under sampling Training Score: {:.2f}%'.format(NNH_trn_score))
print('KNNeighbor Under sampling Testing Score: {:.2f}%'.format(NNH_tst_score))
KNNeighbor Normal Training Score: 0.55% KNNeighbor Normal Testing Score: 0.49% KNNeighbor Over sampling Training Score: 0.56% KNNeighbor Over sampling Testing Score: 0.49% KNNeighbor Under sampling Training Score: 0.55% KNNeighbor Under sampling Testing Score: 0.49%
rf = RandomForestClassifier(n_jobs=2,n_estimators=500,criterion="entropy",random_state=10)
rfm=rf.fit(X_train, y_train)
rfm_trn_score = rf.score(X_train, y_train)
rfm_tst_score = rf.score(X_test, y_test)
print('RandomForest Normal Training Score: {:.2f}%'.format(rfm_trn_score))
print('RandomForest Normal Testing Score: {:.2f}%'.format(rfm_tst_score))
#train model with Over sampled data
rfm=rf.fit(X_train_res, y_train_res)
rfm_trn_score = rf.score(X_train_res, y_train_res)
rfm_tst_score = rf.score(X_test, y_test)
print('RandomForest Over Sampling Training Score: {:.2f}%'.format(rfm_trn_score))
print('RandomForest Over Sampling Testing Score: {:.2f}%'.format(rfm_tst_score))
#train model with under sampled data
rfm=rf.fit(X_rus, y_rus)
rfm_trn_score = rf.score(X_rus, y_rus)
rfm_tst_score = rf.score(X_test, y_test)
print('RandomForest Under Sampling Training Score: {:.2f}%'.format(rfm_trn_score))
print('RandomForest Under Sampling Testing Score: {:.2f}%'.format(rfm_tst_score))
RandomForest Normal Training Score: 1.00% RandomForest Normal Testing Score: 0.51% RandomForest Over Sampling Training Score: 1.00% RandomForest Over Sampling Testing Score: 0.50% RandomForest Under Sampling Training Score: 1.00% RandomForest Under Sampling Testing Score: 0.46%
dt_model = DecisionTreeClassifier(criterion = 'entropy' )
dt_model.fit(X_train, y_train)
dt_trn_score = dt_model.score(X_train, y_train)
dt_tst_score = dt_model.score(X_test , y_test)
print('Decision Tree Classifier Normal Training Score: {:.2f}%'.format(dt_trn_score))
print('Decision Tree Classifier Normal Testing Score: {:.2f}%'.format(dt_tst_score))
#train model with Over sampled data
dt_model.fit(X_train_res, y_train_res)
dt_trn_score = dt_model.score(X_train_res, y_train_res)
dt_tst_score = dt_model.score(X_test , y_test)
print('Decision Tree Classifier Over Sampling Training Score: {:.2f}%'.format(dt_trn_score))
print('Decision Tree Classifier Over Sampling Testing Score: {:.2f}%'.format(dt_tst_score))
#train model with under sampled data
dt_model.fit(X_rus, y_rus)
dt_trn_score = dt_model.score(X_rus, y_rus)
dt_tst_score = dt_model.score(X_test , y_test)
print('Decision Tree Classifier Under Sampling Training Score: {:.2f}%'.format(dt_trn_score))
print('Decision Tree Classifier Under Sampling Testing Score: {:.2f}%'.format(dt_tst_score))
Decision Tree Classifier Normal Training Score: 1.00% Decision Tree Classifier Normal Testing Score: 0.53% Decision Tree Classifier Over Sampling Training Score: 1.00% Decision Tree Classifier Over Sampling Testing Score: 0.50% Decision Tree Classifier Under Sampling Training Score: 1.00% Decision Tree Classifier Under Sampling Testing Score: 0.46%
from sklearn.svm import SVC
svc_model = SVC(C= .1, kernel='linear', gamma= 1)
svc_model.fit(X_train, y_train)
svc_trn_score = svc_model.score(X_train, y_train)
svc_tst_score = svc_model.score(X_test, y_test)
print('Support Vector Machine Normal Training Score: {:.2f}%'.format(svc_trn_score))
print('Support Vector Machine Normal Testing Score: {:.2f}%'.format(svc_tst_score))
#train model with Over sampled data
svc_model.fit(X_train_res, y_train_res)
svc_trn_score = svc_model.score(X_train_res, y_train_res)
svc_tst_score = svc_model.score(X_test, y_test)
print('Support Vector Machine Over sampling Training Score: {:.2f}%'.format(svc_trn_score))
print('Support Vector Machine Over sampling Testing Score: {:.2f}%'.format(svc_tst_score))
#train model with under sampled data
svc_model.fit(X_rus, y_rus)
svc_trn_score = svc_model.score(X_rus, y_rus)
svc_tst_score = svc_model.score(X_test, y_test)
print('Support Vector Machine under sampling Training Score: {:.2f}%'.format(svc_trn_score))
print('Support Vector Machine under sampling Testing Score: {:.2f}%'.format(svc_tst_score))
Support Vector Machine Normal Training Score: 0.54% Support Vector Machine Normal Testing Score: 0.51% Support Vector Machine Over sampling Training Score: 0.57% Support Vector Machine Over sampling Testing Score: 0.49% Support Vector Machine under sampling Training Score: 0.55% Support Vector Machine under sampling Testing Score: 0.49%
6. Post Training and Conclusion:
Q 6A. Display and compare all the models designed with their train and test accuracies.
Ans 6A.
Q 6B. . Select the final best trained model along with your detailed comments for selecting this model
Ans 6B.
Q 6C. C. Pickle the selected model for future use
Ans 6C.
import pickle
# declare the file name , no path or folder added hence will be stored to local default folder
DTC_pickle_filnm = 'DTC_pickle.pyl'
# Open the stream to write
OutputFile = open( DTC_pickle_filnm, 'wb')
#Write to the file
pickle.dump(dt_model , OutputFile )
# close the stream to save the file for future use.
OutputFile.close()
Q 6D. Write your conclusion on the results.
Ans 6D.